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We conduct a detailed investigation of correlations between real-time expressions of individuals 
made across the United States and a wide range of emotional, geographic, demographic, and health 
characteristics. We do so by combining (1) a massive, geo-tagged data set comprising over 80 million 
words generated over the course of several recent years on the social network service Twitter and (2) 
annually-surveyed characteristics of all 50 states and close to 400 urban populations. Among many 
results, we generate taxonomies of states and cities based on their similarities in word use; estimate 
the happiness levels of states and cities; correlate highly-resolved demographic characteristics with 
happiness levels; and connect word choice and message length with urban characteristics such as 
education levels and obesity rates. Our results show how social media may potentially be used to 
estimate real-time levels and changes in population-level measures such as obesity rates. 



I. INTRODUCTION 

With vast quantities of real-time, fine-grained data, 
describing everything from transportation dynamics, 
resource usage, and social interactions, the science of 
cities has entered the realm of the data-rich fields. While 
much work and development lies ahead, the opportu- 
nity to scientifically engage with urban phenomena has 
now become broadly available to quantitatively-minded 
researchers [5]. And with over half the world's population 
now living in urban areas, and this proportion continu- 
ing to grow, cities, long central to human society, will 
only become increasingly more so [22]. Our focus here 
concerns one of the many important questions we are led 
to continuously address about cities: how does living in 
urban areas relate to well-being? Such an undertaking is 
part of a general program seeking to quantify and explain 
the evolving cultural character — the stories — of cities, as 
well as geographic places of larger and smaller scales. 

Numerous studies on well-being are published every 
year. The UN's 2012 World Happiness Report attempts 
to quantify happiness on a global scale using a 'Gross 
National Happiness' index which uses data on rural- 
urban residence and other factors [28]. In the US, Gallup 
and Healthways produce a yearly report on the well- 
being of different cities, states and congressional dis- 
tricts [19], and maintain a well-being index based on con- 
tinual polling and survey data [3]. Other countries are 
beginning to produce metrics measuring well-being: in 
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2012, surveys measuring national well-being and how it 
relates to both heath and where people live were conduct- 
ed in both the United Kingdom by the Office of National 
Statistics [4, 27] and in Australia by Fairfax Media and 
Lateral Economics [16]. 

While these and other approaches to quantifying the 
sentiment of a city as a whole rely almost exclusively 
on survey data, there is now a range of complementary, 
remote-sensing methods available to researchers. The 
explosion of the amount and availability of data relat- 
ing to social network use in the past 15 years has driven 
a rapid increase in the application of data-driven tech- 
niques to the social sciences and sentiment analysis of 
large-scale populations. 

Our overall aim in this paper is to investigate how 
geographic place correlates with and potentially influ- 
ences societal levels of happiness. In particular, after 
first examining happiness dynamics at the level of states, 
we will explore urban areas in the United States in depth, 
and ask if it is possible to (a) measure the overall aver- 
age happiness of people located in cities; and (b) explain 
the variation in happiness across different cities. Our 
methodology for answering the first question uses word 
frequency distributions collected from a large corpus of 
geolocated messages or 'tweets' collected from Twitter, 
with individual words scored for their happiness indepen- 
dantly by users of Amazon's Mechanical Turk service [2]. 
This technique was introduced by Dodds and Danforth 
(2009) [10] and greatly expanded upon in Dodds et al. 
(2011) [11], as well as tested for robustness and sensi- 
tivity. In attempting to answer the second question of 
happiness variability, we examine how individual word 
usage correlates with happiness and various social and 
economic factors. To do this we use the 'word shift graph' 
technique developed in [10, 11], as well as correlate word 
usage frequencies with traditional city-level census sur- 
vey data. As we will show, the combination of these 
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techniques produces significant insights into the charac- 
ter of different cities and places. 

We structure our paper as fohows. In Section II, we 
describe the data sets and our methodology for measur- 
ing happiness. In Section III we measure the happiness 
of different states and cities and determine the happiest 
and saddest states and cities in the US, with some anal- 
ysis of why places vary with respect to this measure. In 
Section IV we compare our results for cities with census 
data, correlating happiness and word usage with common 
economic and social measures. We also use the word fre- 
quency distributions to group cities by their similarities 
in observed word use. We conclude with a discussion in 
Section V. 



II. DATA AND METHODOLOGY 

We examine a corpus of over 10 million geotagged 
tweets gathered from 373 urban areas in the contigu- 
ous United States during the calendar year 2011. This 
corpus is a subset of Twitter's garden hose feed, and 
represents roughly 10% of all geotagged tweets posted 
in 2011. Urban areas are defined by the 2010 United 
States Census Bureau's MAF/TIGER (Master Address 
File/Topologically Integrated Geographic Encoding and 
Referencing) database [9]. See Appendix A for a more 
detailed description of the data set as well as an explo- 
ration of the relationship between area and perimeter, or 
fractal dimension, of these cities. 

To measure sentiment (hereafter happiness) in these 
areas from the corpus of words collected, we use the 
Language Assessment by Mechanical Turk (LabMT) 
word list (available online in the supplementary material 
of [11]), assembled by combining the 5,000 most frequent 
words occurring in each of four text sources: Google 
Books (English), music lyrics, the New York Times and 
Twitter. A total of roughly 10,000 of these individual 
words have been scored by users of Amazon's Mechan- 
ical Turk service on a scale of 1 (sad) to 9 (happy), 
resulting in a measure of average happiness for each given 
word [23]. For example, 'rainbow' is one of the happiest 
words in the list with a score of /lavg = 8.1, while 'earth- 
quake' is one of the saddest, with /lavg = 1-9. Neutral 
words like 'the' or 'thereof tend to score in the middle 
of the scale, with /lavg = 4.98 and 5 respectively. 

For a given text T containing TV unique words, we cal- 
culate the average happiness /lavg by 

Z^i=l i=l 

where fi is the frequency of the ith. word Wi in T for which 
we have a happiness value hg,.^g{wi), and pi = fi/ fi 
is the normalized frequency of word Wi. 

Importantly, with this method we make no attempt to 
take the context of words or the meaning of a text into 
account. While this may lead to difficulties in accurately 



determining the emotional content of small texts, we find 
that for sufficiently large texts this approach nonethe- 
less gives reliable (if eventually improvable) results. An 
analogy is that of temperature: while the motion of a 
small number of particles cannot be expected to accu- 
rately characterize the temperature of a room, an average 
over a sufficiently large collection of such particles defines 
a durable quantity. Furthermore, by ignoring the context 
of words we gain both a computational advantage and a 
degree of impartiality; we do not need to decide a pri- 
ori whether a given word has emotional content, thereby 
reducing the number of steps in the algorithm and hope- 
fully reducing experimental bias. 

Following Dodds et al. (2011), for the remainder of 
this paper, we remove all words Wi for which the hap- 
piness score falls in the range 4 < /lavg('^i) < 6 when 
calculating /lavg(^)- Removal of these neutral or 'stop' 
words has been demonstrated to provide a suitable bal- 
ance between sensitivity and robustness for our 'hedo- 
nometer' [11]. Further details on how we preprocessed 
the Twitter data set can be found in Appendix A. 

We will correlate our happiness results with census 
data taken from the American Community Survey 1- 
year estimates for 2011, accessed online at http:// 
f actf inder2 . census . gov/. 



III. HAPPINESS ACROSS STATES AND 
URBAN AREAS 

We first examine how happiness varies on a somewhat 
coarser scale than we will focus on for the majority of this 
paper, by plotting the average happiness of all states in 
the US in figure 1. To avoid the problem that some states 
have happier names than others (for example, Hawaii), 
we removed each state name from the calculation for /lavg- 
We remark first that at such a coarse resolution there 
is little variation between states, which all lie between 
0.15 of the mean value for the entire United States of 
^avg = 6.01. The happiest state is Hawaii with a score 
of /i^avg = 6.17 and the saddest state is Louisiana with 
a score of /lavg = 5.88. Hawaii emerges as the happiest 
state due to an abundance of relatively happy words such 
as 'beach' and food-related words, but also because of the 
presence of the word 'hi'. This is most likely because of 
an increased use of Hawaii's state code 'HI' in geotagged 
tweets, and will somewhat bias the results. However, we 
chose not to remove this word from the data set because 
its use in place of 'hello' will contribute to the happiness 
score in other states, and the rich variety of happy words 
occurring in Hawaii paints a convincing picture of it as a 
happy state regardless of this small bias. A similar result 
showing greater happiness and a relative abundance of 
food-related words in tweets made by users who regular- 
ly travel large distances (as would be the case for many 
of the tweets emanating from Hawaii) has been reported 
in [18]. Louisiana is revealed as the saddest state pri- 
marily as a result of an abundance of profanity relative 
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FIG. 1: Choropleth showing average word happiness for geotagged tweets in all US states collected during the calendar year 
2011. The happiest 5 states, in order, are: Hawaii, Maine, Nevada, Utah and Vermont. The saddest 5 states, in order, are: 
Louisiana, Mississippi, Maryland, Delaware and Georgia. Word shift plots describing how differences in word usage contribute 
to variation in happiness between states are presented in Appendix B (online). 



to the other states, in stark contrast with the findings of 
Oswald and Wu [25] that Louisiana exhibited the highest 
score on an alternate measure of life satisfaction. 

We can further use this data on word frequencies to 
characterize similarities between states based on word 
usage. Figure 2 shows the linear correlation between 
word frequency vectors f = {fi^i = 1 : 50000} for each 
pair of states, with red entries in the matrix indicating 
states with similar word use. We see some clusters which 
might be explained by geographical proximity, such as 
Vermont and New Hampshire or Louisiana and Mississip- 
pi, and some outliers such as the state of Nevada, which 
correlates the lowest on average with all other states. 
Additional details on this state- level dataset, including 
plots of raw number of tweets and number of tweets 
per head of population for each state can be found in 
Appendix A. Word shift graphs showing which words 
contribute most to the variation in happiness across 
states can be found in Appendix B (online) [1]. 



We now change our resolution to a finer scale by 
focussing on cities rather than states. As an illustra- 
tion of the resolution of the data set as well as our tech- 
nique, we plot a tweet-generated map of a city, showing 
how average word happiness varies with location. In fig- 
ure 3 we plot tweets collected from the New York City 
area during 2011. Each point represents an individual 
tweet, and is colored by the happiness h^vg of the text 
T consisting of the TV = 200 closest LabMT words to 
the location of that tweet. We set a maximum threshold 
radius of r = 500 meters around each tweet location; if 
200 LabMT words cannot be found within that radius 
then the point is colored black. Several features can 
immediately be discerned in this purely tweet-generated 
map. Firstly, the spatial resolution reveals the outline 
of Manhattan, as well as Central Park, individual streets 
and bridges, and even airport terminals such as those at 
JFK and Newark airports at the lower right and center 
left of the figure respectively. Secondly, we can discern 
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FIG. 2: Clustergram showing cross-correlations between word frequency distributions for all states in 2011. Red signifies 
states with similar or highly-correlating word frequency distributions, while blue signifies states with relatively dissimilar word 
frequency distributions. 



regions of higher and lower happiness: the Harlem and 
Washington Heights areas to the north appear relatively 
sad compared to the Downtown/Midtown area^ as does 
the Waterfront, New Jersey area west of the southern 
tip of Manhattan. Similar tweet-generated maps for all 
373 cities in the data set are presented in Appendix B 
(online) [1]. 

In figure 4 we show a tweet-generated happiness map 
of the entire contiguous United States, where we have 



now used N = 500 and r = 10 km. We can clearly 
discern cities and the roads between them at this scale, 
and substantial variation in happiness across geograph- 
ical regions. There is already an indication that some 
cities will be significantly less happy than others, partic- 
ularly those in the southeastern United States, a conclu- 
sion which will be made more quantitative later. At a 
finer scale we can see that some coastal areas, particular- 
ly around the Florida peninsula and along the coast of 
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FIG. 3: Map of tweets collected from New York City during the calendar year 2011. Each point represents an individual tweet 
and is colored by the average word happiness /lavg of nearby tweets: red is happier, blue is sadder. For a point to be colored, 
we require that there be at least 100 LabMT words within a 500 meter radius of the location; points which do not satisfy this 
criterion are colored black. 



North and South Carolina, are significantly happier than 
the regions immediately inland of them. We will see this 
again below in the word shifts for various beachside cities. 

Next we calculate the happiness /lavg for each city in 
the census data set using equation (1), where the bound- 
aries of a city are defined by the MAF/TIGER database, 
and each text T is formed by agglomerating all the words 
falling within that city. Figure 5 shows the distribution 
of happiness scores for all cities; as is to be expected 
for smaller samples, the range of values is slightly high- 
er than that calculated for the states, extending over a 
range of more than 0.2 from the mean of /lavg = 6.00. We 
remark that the distributed is skewed: there are more 
cities that are happier than the overall average, by 220 
to 153. 

It is well known that city population sizes follow a pow- 
er law distribution (see [36] and many others), which 
in conjunction with figure 5 suggests that happiness 
decreases with city size. While we did find a slight neg- 
ative correlation between happiness and the number of 
tweets gathered in each city, we in fact found that hap- 
piness strongly negatively correlates with the number of 
tweets per capita, with Spearman correlation coefficient 
-0.558 and p- value less than 10~^^, as shown in figure 



6. This suggests that cities with high technology adop- 
tion rates (as most geotagged tweets come from devices 
like smartphones) are in fact less happy than their less- 
technological counterparts. 

The bar charts in figures 7 and 8 show the average word 
happiness /lavg for the 15 happiest and 15 saddest cities 
in the contiguous United States, respectively. Using this 
method we identify Napa, California as the happiest city 
in the US with a score of 6.26, and Beaumont, Texas as 
the saddest city with a score of 5.83. 

Perhaps surprisingly, several cities that ranked both 
highly and lowly by our measure rank similarly in more 
traditional survey based efforts. For example, a Gallup- 
Healthways well-being survey for 2011 [19] showed Boul- 
der, Colorado as the city with the fifth highest well- 
being index composite score (and twelfth highest hap- 
piness score in our list), while Flint, Michigan had the 
second lowest and Montgomery, Alabama the 21st-lowest 
well-being index (compared to 8th lowest and 14th lowest 
happiness scores on our list). The overall Spearman cor- 
relation between the rankings using Gallup's well-being 
index and with our measure is p = 0.328, with p- value 
7.73 X 10~^ (a scatter plot is presented online in Appendix 
C). Whereas our list uses only word frequencies in the 
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Happiness 

FIG. 5: Histogram showing the distribution of happiness 
values for the 373 cities in the census data set. A vertical 
dashed line denotes the average for all cities. Note the greater 
weight towards the right of the distribution, with more cities 
having happiness scores higher than the average. 
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FIG. 7: The 15 highest average word happiness scores /lavg 
for cities in the contiguous USA, as calculated using (1) and 
the LabMT word list. 
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FIG. 6: Happiness as a function of number of tweets per 
capita. Areas with a higher density of tweets per capita tend 
to be less happy. 



FIG. 8: The 15 lowest average word happiness scores /lavg 
for cities in the contiguous USA, as calculated using (1) and 
the LabMT word list. 



calculation of /lavg, the Gallup-Healthways score is an 
average of six indices which measure life evaluation, emo- 
tional health, work environment, physical health, healthy 
behaviors, and access to basic necessities. We remark 
that our method is (a) far more efficient to implement 
than a survey-based approach, and (b) provides a near 
real-time stream of information quantifying well-being in 
cities. 

To investigate why the average word happiness varies 
across urban areas, we study the word shift graphs [10, 
11] for each city. These graphs show how the difference 
in happiness for two texts depends on differences in the 
underlying word frequencies. In figure 9 we show the 
word shift graphs for Napa and Beaumont, as compared 
to the entire corpus of words collected for all urban areas 
during 2011. Word shift graphs for every city are pre- 
sented in Appendix C (online) [1]. 



We observe some features of the graphs that are con- 
sistent with geography — for example the word 'beach' 
appears high on the list of words for coastal cities such as 
Santa Cruz, California or Miami, Florida. Overall, the 
main factor driving the relative happiness scores for each 
city appears to be the presence or absence of key words 
such as 'lol', 'haha' and its variants, 'hell', 'love', 'like', 
as well as profanity. 



IV. CORRELATING WORD USAGE WITH 
CENSUS DATA 

The word shifts of figure 9 demonstrate how word 
usage varies with location, as well as the importance of 
studying the individual words that go in to the calcula- 
tion of averaged quantities such as the word happiness 
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FIG. 9: Word shift graphs showing how /lavg varies for all US cities measured versus the cities Napa, California (left) 
and Beaumont, Texas (right) with highest and lowest /lavg respectively. Words are ranked in order of decreasing percentage 
contribution to the overall average happiness difference Shavg- The symbols +/— indicate whether a word is relatively happy 
or sad compared to /lavg for the entire US (text Tref), while the arrows t / i indicate whether the word was used more or less 
in the text Tcomp for each city than in Tref. The left inset panel shows how the ranked LabMT words combine in sum. The 
four circles at bottom right show the total contribution of the four kinds of words (+ |, + t, — — i)- Relative text size is 
indicated by the areas of the gray squares. 
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^avg- We now therefore examine in greater detail how 
happiness and word usage relate to underlying social fac- 
tors. 

We first focus on how the average happiness /lavg, 
correlates with different social and economic measures. 
To do this we took data from the American Commu- 
nity Survey 1-year estimates for 2011, specifically tables 
DP02 through DP05, covering selected social characteris- 
tics, economic characteristics, housing characteristics and 
demographic and housing estimates. These tables con- 
tained 508 different categories for all cities, from which 
we removed the categories with data on less than 75% of 
all cities, leaving 432 different categories for correlation 
with happiness. 

In figure 10 we show the Spearman correlation between 
happiness and each demographic attribute for cities in 
the census data set. Each point in the graph represents 
one of the 432 attributes considered; a table listing each 
demographic and its correlation with happiness is pre- 
sented in Appendix D (online) [1]. The groupings into 
columns were made independently of happiness values, 
by performing distance-based clustering using a hierar- 
chical cluster tree on the table of census attributes for all 
cities. The 8 clusters which were found are not unique 
and depend on the distance threshold used, however they 
give some indication of which attributes covary. Only two 
groups show a large number of attributes which signifi- 
cantly correlate (below p = 0.01) with happiness; these 
are shown in blue (with red crosses specifying the median 
attribute) . These two groups might be broadly character- 
ized as representing high socioeconomic and low socioe- 
conomic status respectively, with many of the attributes 
in the high socioeconomic status group positively corre- 
lating with happiness (and vice versa for the low socioe- 
conomic status group). 

To further understand what drives this correlation of 
certain demographics with happiness, we now investigate 
how each word from the LabMT list correlates with all 
attributes from the census. To do this we first normalize 
the word counts in each urban area by the total number 
of tweets collected in each city, and then for each word 
calculate the Spearman correlation p between normalized 
frequency and census attribute for all cities. For example, 
the scatter plot in figure 11 shows that the normalized 
frequency of occurrence of the word 'cafe' shows a strong 
positive correlation with the percentage of the popula- 
tion with a bachelor's degree or higher. The Spearman 
correlation between the two is p = 0.481 with p- value 
4.90 X 10~^^, indicating strong correlation. 

Lists showing the correlation of each LabMT word with 
every demographic attribute are presented in Appendix 
D (online) [1]. Taking the percentage of population 
with a bachelors degree or higher for urban areas from 
the 2011 census as a representative example, tables I 
and II show the top 25 words which show the high- 
est positive and negative correlations respectively to this 
attribute. The results show that longer words such as 
'software', 'development' and 'emails' correlate strongly 



with education, while the words which correlate negative- 
ly with education are generally shorter, with no words 
longer than two syllables appearing in the list. Further- 
more, many of the words such as 'love', 'talk' and 'mom' 
appearing in table II are family- or relationship-oriented, 
while the more technical terms appearing in table I are 
more employment-oriented, and suggest more complex 
and abstract intellectual themes. It may be postulated 
that this is a reflection of the social processes occurring 
in urban areas characterized by rates of low and high 
education, respectively. 
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1.57 X 10" 


-16 


6.86 


lounge 


0.409 


1.68 X 10" 


-16 


6.50 


market 


0.408 


2.2 X 10" 


16 


6.28 


india 


0.407 


2.5 X 10- 


16 


6.42 


drinking 


0.405 


3.74 X 10" 


-16 


6.14 


technology 


0.405 


3.76 X 10" 


-16 


6.74 


forest 


0.405 


3.83 X 10" 


-16 


6.68 


brunch 


0.405 


3.89 X 10" 


-16 


6.32 


dining 


0.403 


4.92 X 10" 


-16 


6.48 


supporting 


0.399 


1.1 X 10" 


15 


6.48 


professor 


0.398 


1.23 X 10" 


-15 


6.04 


university 


0.392 


3.62 X 10" 


-15 


6.74 


film 


0.391 


4.27 X 10" 


-15 


6.56 


global 


0.391 


4.72 X 10" 


-15 


6.00 



TABLE I: Top 25 words with strongest positive Spearman 
correlation p to percentage of population with a Bachelors 
degree or higher (census table DP02-HC03-VC94) in 2011. 
Stop words with 4 < /lavg < 6 have been removed from the 
list. Note the low ^> values for all words, indicating strong 
statistical significance. 

The technique applied here is not limited to only the 
traditional types of data collected through the census. As 
an example of a different use of use of the data set, we 
correlate word use to obesity at the metropolitan level. 
For this study we take obesity levels from the Gallup and 
Healthways 2011 survey [35], and metropolitan areas as 
defined by the U.S. Office of Management and Budget's 
definitions for Metropolitan Statistical Areas (MSAs) 
[31]. We remark that the MSAs are generally two to three 
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FIG. 10: Spearman correlations for 432 demographic attributes with happiness. The 8 groupings along the horizontal axis 
are for covarying attributes identified by agglomerative hierarchical clustering, independently of happiness. Crosses lie on the 
median of each cluster, and the dashed lines represent the 1% significance level. The two clusters which have medians that 
correlate significantly with happiness are colored blue. A complete list of the correlation of all attributes with happiness can 
be found in Appendix D (online). 



times larger in area than the TIGER urban area census 
boundaries, and the Gallup obesity survey was only for 
the 190 largest-population areas. The obesity data set 
contains fewer small cities than the TIGER census set, 
particularly in the midwest. We collected more than 10 
million tweets from these 190 MSAs, corresponding to 
just over 80 million words during 2011. 

Performing the same analysis as for the attributes in 
figure 10, in figure 12 we show the relationship between 
happiness and obesity for the 190 MSAs included in the 
Gallup survey. We find that happiness generally decreas- 
es as obesity increases, with the third happiest city in 
this set (Boulder, Colorado) corresponding with the low- 
est obesity rate (12.1%) and the saddest city (Beaumont, 
Texas, as found previously) corresponding with the fifth 
highest obesity rate (33.8%). We calculate a Spear- 
man correlation coefficient of p = —0.339 with p- value 
2.01 X 10~^ for the data, indicating statistically signifi- 



cant negative correlation. 

As previously for the census data, we also correlate 
the abundance of each individual word in the LabMT 
list to obesity levels in the 190 cities surveyed. From this 
list we extract words that are clearly food-related, and 
present those which most most strongly negatively and 
positively correlate with obesity in table III. Note that 
we are including stop words for which 4 < /lavg('^i) < 6 
in these lists. Coffee-related words such as 'cafe', 'cof- 
fee', 'espresso' and 'bean' feature prominently in the list, 
and many of the words refer to eating at restaurants — 
'sushi', 'restaurant', 'cuisine' and 'brunch', for example. 
As we might expect such words to correlate with wealth, 
this suggests a correlation between obesity and poverty, 
a claim which we note remains contentious in the medical 
literature (for example, supported in [21, 24], and refuted 
in [7]). 

Conversely, only 6 food-related words significantly pos- 
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FIG. 11: Scatter plot showing the correlation between rate 
of occurrence of the word 'cafe' and percentage of population 
with a bachelor's degree or higher in US cities during the 
calendar year 2011. The red line shows linear correlation while 
the reported p and p- values show the Spearman correlation. 



6.15 



6.1 



^ 6.05 


■q. 

Q. 6 
X 

5.95 



5.9 



A. 




-0.339 



p-value = 2.01 x 10"^ 

15 20 25 30~ 

Obesity rate (%) 



35 



FIG. 12: Scatter plot showing correlation between /lavg and 
obesity level, as taken from the 2011 Gallup and Healthways 
survey. The red line is the straight line of best fit to the data, 
while the p value is the Spearman correlation coefficient for 
the data. 



itively correlate with obesity with values less than 0.05 
(note again the asymmetry in the number of words which 
positively and negatively correlate with obesity). The 
fast food chain 'mcdonalds' correlates most strongly, and 
the foods 'wings' and 'ham' both appear. Unlike in 
the low-obesity word table, words describing a desire for 
food — 'eat' and 'hungry' — as well as the negative reaction 
of 'heartburn' to overeating, both appear on the list. In 
Appendix A we show tables listing the food-related words 
which show the least correlation with obesity, as well as 
the top 25 words (food-related or not) from the LabMT 
list that correlate and anti-correlate with obesity. The 
full list of LabMT words and their correlations with obe- 



Word 


P 


P- 


value 






?(^0 


me 


-0 


393 


3.26 X 10" 


15 


6 


58 


love 


-0 


389 


6.51 X 10" 


15 


8 


42 


my 


-0 


354 


1.97 X 10" 


12 


6 


16 


like 


-0 


346 


6.04 X 10" 


12 


7 


22 


hate 


-0 


344 


8.76 X 10" 


12 


2 


34 


tired 


-0 


343 


1 X 10"^^ 


3 


34 


sleep 


-0 


341 


1.27 X 10" 


11 


7 


16 


stupid 


-0 


328 


8.55 X 10" 


11 


2 


68 


bored 


-0 


315 


5.11 X 10" 


10 


3 


04 


you 


-0 


315 


5.23 X 10" 


10 


6 


24 


goodnight 


-0 


305 


1.77 


X 10" 


-9 


6 


58 


bitch 


-0 


295 


6.51 


X 10" 


-9 


3 


14 


all 


-0 


289 


1.33 


X 10" 


-8 


6 


22 


lie 


-0 


285 


2.24 


X 10" 


-8 


2 


60 


mom 


-0 


284 


2.42 


X 10" 


-8 


7 


64 


wish 


-0 


271 


1.05 


X 10" 


-7 


6 


92 


talk 


-0 


267 


1.74 


X 10" 


-7 


6 


06 


she 


-0 


265 


2.01 


X 10" 


-7 


6 


18 


know 


-0 


262 


2.78 


X 10" 


-7 


6 


10 


ill 


-0 


259 


4.11 


X 10" 


-7 


2 


42 


dont 


-0 


258 


4.54 


X 10" 


-7 


3 


70 


well 


-0 


256 


5.3 X 10" 


7 


6 


68 


don't 


-0 


255 


5.8 


X 10" 


7 


3 


70 


give 


-0 


255 


5.84 X 10" 


-7 


6 


54 


friend 


-0 


255 


6.27 X 10" 


-7 


7 


66 



TABLE II: Top 25 words with strongest negative Spearman 
correlation p to percentage of population with a Bachelors 
degree or higher in 2011 (with stop words removed). 



sity can be found in Appendix E (online) [1]. 

The above analysis demonstrates that different cities 
have unique characteristics. We now ask whether cities 
can be sorted into groups solely based upon similarities in 
their word distributions. Bettencourt et at. [6] used data 
on the economy, crime and innovation to characterize 
cities; here we use a similar methodology except with 
word frequency data to uncover so-called 'kindred' cities. 

We group the top 40 cities with highest word counts in 
2011 by calculating the linear correlation between word 
frequency vectors f as we did in Figure 2. The resulting 
cross-correlation matrix is shown in figure 13, with red 
signifying strong correlation between cities. Firstly we 
note that all cities show similar word frequency distribu- 
tions, with all correlations being higher than p = 0.8. As 
was the case for the states (see figure 2), we see one clear 
large group of strongly correlated cities emerge in the 
lower right corner, with a smaller distinct cluster appear- 
ing at the top left. Perhaps uniquely, these groupings are 
defined solely by similarities in word usage between cities, 
rather than by geography or economic indicators. 

We cluster cities using an agglomerative hierarchical 
method with average linkage clustering, as shown in the 
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Word 


P 


p- value 




cafe 


-0.509 


6.07 X 10- 


14 


6.78 


sushi 


-0.487 


9.93 X 10" 


13 


5.40 


brewery 


-0.469 


8.67 X 10- 


12 


N/A 


restaurant 


-0.448 


8.93 X 10- 


11 


7.06 


bar 


-0.435 


3.59 X 10- 


10 


5.82 


banana 


-0.434 


3.77 X 10- 


10 


6.86 


apple 


-0.408 


5.22 X 10" 


-9 


7.44 


fondue 


-0.403 


8.34 X 10" 


-9 


N/A 


wine 


-0.400 


1.08 X 10" 


-8 


6.42 


delicious 


-0.392 


2.17 X 10" 


-8 


7.92 


dinner 


-0.386 


3.85 X 10" 


-8 


7.40 


coffee 


-0.384 


4.51 X 10" 


-8 


7.18 


bakery 


-0.383 


5.12 X 10" 


-8 


N/A 


bean 


-0.378 


7.88 X 10" 


-8 


5.80 


espresso 


-0.377 


8.47 X 10" 


-8 


N/A 


cuisine 


-0.376 


8.82 X 10" 


-8 


N/A 


foods 


-0.374 


1.07 X 10" 


-7 


7.26 


tofu 


-0.372 


1.27 X 10" 


-7 


N/A 


brunch 


-0.368 


1.79 X 10" 


-7 


6.32 


veggie 


-0.364 


2.46 X 10" 


-7 


N/A 


organic 


-0.361 


3.13 X 10" 


-7 


6.32 


booze 


-0.360 


3.34 X 10" 


-7 


N/A 


grill 


-0.354 


5.4 X 10- 


7 


6.24 


chocolate 


-0.351 


6.77 X 10" 


-7 


7.86 


#vegan 


-0.350 


7.47 X 10" 


-7 


N/A 


mcdonalds 


0.246 


6.18 X 10" 


-4 


5.98 


eat 


0.241 


8.22 X 10" 


-4 


7.04 


wings 


0.222 


2.13 X 10" 


-3 


6.52 


hungry 


0.210 


3.65 X 10" 


-3 


3.38 


heartburn 


0.194 


7.37 X 10" 


-3 


N/A 


ham 


0.177 


1.45 X 10" 


-2 


5.66 



TABLE III: The top 25 food-related words only with strongest 
negative correlation to obesity level (top), and the 6 food- 
related words with positive correlation to obesity level and 
p- value less than 0.05 (bottom). 



dendrogram at the top of figure 13, and highlight the 4 
clusters with lowest linkage threshold using different col- 
ors. As one might expect, some cities that are geographi- 
cally nearby are grouped together. Notable examples are 
some cities in the southern US such as Baton Rouge, New 
Orleans and Memphis in the lower right of the plot, as 
well as the Californian cities of San Diego and San Fran- 
cisco at top left. However, this pattern does not hold for 
all cities; while there is the suggestion of a north/south 
grouping between the two clusters at the top left and 
the two at the bottom right, some cities such as Austin 
and Tampa in the south and Detroit and Philadelphia 
in the north go against this trend. The cities of Cleve- 
land and Detroit are the most alike in word use, having 
a cross-correlation of p = 0.995, while Austin and Baton 



Rouge are the most dissimilar with a cross-correlation of 
p = 0.813. Indianapolis is the city with highest average 
correlation to the word use in other cities (p = 0.961), 
while Minneapolis shows the most unique word use on 
average, with p = 0.884. 



V. DISCUSSION 

In this paper we have examined word use in urban 
areas in the United States, using a simple mathematical 
method which has been shown to have great flexibili- 
ty, sensitivity and robustness. We have used this tool to 
map areas of high and low happiness and score individual 
cities for average word happiness. In order to understand 
in greater detail how word usage influences happiness, we 
used both word shift graphs to flnd the words which pro- 
duced the most difference between the happiness scores of 
each city and the average for the entire US, and socioeco- 
nomic census data to attempt to explain the usage of cer- 
tain words. A significant driver of the happiness score for 
individual cities was found to be frequency of swear word 
use; we believe that future studies of regional variation 
in swear word use or 'geoprofanity' could help explain 
geographical differences in happiness. Indeed, swearing 
has previously been found to be a predictor of large-scale 
protests and social uprisings in Iran [15]. 

Happiness within the US was found to correlate strong- 
ly with wealth, showing largest positive correlation with 
household income and strongest negative correlation with 
poverty amongst the census data sets used. This is con- 
sistent with the first part of the 'Easterlin paradox' [13], 
that within countries at a given time happiness consis- 
tently increases with income. The second part of the 
paradox is that while personal wealth has been observed 
to consistently increase over time, happiness has tended 
to decrease in both developed and developing countries 
[13, 14]. A previous result using this method showing a 
decline in happiness over the 2009-2011 period (see fig- 
ure 3 of [11]) is consistent with this finding. The relation- 
ship between wealth and happiness is still highly debated; 
recent works by Stevenson and Wolfers [32] claim to show 
a direct correlation between gross domestic product and 
subjective well-being across countries, while Di Telia and 
MacCuUoch [8] in the same year argue that the Easterlin 
paradox is in fact exacerbated if other economic variables 
than just income are considered. 

Interestingly, happiness was also observed to anticor- 
relate significantly with obesity. A similar link between 
obesity and happiness has previously been reported [17], 
particularly for individuals who report low self control 
[33]. However, as some authors point out, the presence of 
chronic illnesses accompanying obesity can confound the 
link between obesity and psychological well-being [12], 
and indeed an inverse relationship between weight and 
depression has been found in some studies [26]. We 
remark that it should be possible to use techniques such 
as those described here to mine social network data for 
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FIG. 13: Cross-correlations between word frequency distribution differences for the 40 cities with highest word counts. Red 
signifies cities with similar word frequency distribution, while blue signifies cities with dissimilar word frequency distributions. 



real-time surveying. For exaraple, the potential for iden- 
tifying areas with high obesity based solely on word use 
is significant. 

There are a number of legitimate concerns to be raised 
about how well the Twitter data set can be said to repre- 
sent the happiness of the greater population. Only 15% 



of online adults regularly use Twitter, and 18-29 year- 
olds and minorities tend to be more highly represented 
on Twitter than in the general population [30]. Further- 
more, the fact that we collected only around 10% of all 
tweets during the calendar year 2011 means that our data 
set is a non-uniform subsample of statements made by a 
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non-representative portion of the population. 

In this work we have only scratched the surface of what 
is possible using this particular dataset. In particular, we 
have not examined whether or not these methods have 
any predictive power — future research could look at how 
observed changes in the Twitter data set, as measured 
using the hedonometer algorithm, predict changes in the 
underlying social and economic characteristics measured 
using traditional census methods. In particular, we plan 
to revisit this study when census data for 2012 becomes 
available to investigate how changes in demographics 
across urban areas is reflected in happiness as measured 
by word use. 
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Appendices 

Appendix A: Data set and states 

In figure Al we show the relationship between perime- 
ter and area for the 3592 cities in the MAF/TIGER 
data database, which follow an approximate power law 
with exponent 1.294. The smallest city in both area and 
perimeter is Richmond, California, while the largest city 
is New York, whose perimeter extends far north into Con- 
necticut and is agglomerated with Newark, New Jersey 
in this data set. We find that city area shows an approx- 
imate power-law dependence upon perimeter, with an 
average fractal dimension of a = 1.294. Similar results 
have of course been reported previously for cities [29, 34], 
and has even been found to compare well with the fractal 
dimension of malignant skin lesions [20]. 
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FIG. Al: Approximate power law relationship between city 
area and perimeter for all 3592 cities in the census data set. 
The fractal dimension is approximately 1.294. 

In preprocessing the Twitter data set we have attempt- 
ed to remove tweets from users that are clearly auto- 
mated bots, in particular tweets from weather-recording 
services which periodically report values of temperature, 
humidity and the like. Users for whom more than 15% 
of their tweets contained the words 'humid', 'humidity', 
'pressure' or 'earthquake' were removed from the dataset. 
We also made the decision to remove all variants of the 
racial pejorative or 'N-word' from calculations of /lavg- 
Variants of this word have very low happiness values, 
averaging /lavg = 2.92, and consequently were found to 
be highly influential in determining the average city hap- 
piness. However, when examining individual tweets we 
found that this word appeared to be being used in conver- 
sation as a more colloquial stand in for the word 'friend' 
in the vast majority of cases, and not in fact in any par- 



ticularly negative sense. As such, we decided that scoring 
of the word was unfairly biasing our results towards the 
negative and removed it because of this. Future work 
will investigate the scoring of phrases instead of words, 
which will reduce the need for this type of adjustment. 

For each city we create the normalized word frequency 
distribution f{i) = fi/n, where n is the total number of 
tweets collected for that city. The sum Y^f fi/n there- 
fore represents the average number of LabMT words per 
tweet, the mean of which is approximately 7.1. In figure 
A2 we show the average tweet length for the US cities for 
which we have collected more than 50000 words through- 
out 2011. Average tweet lengths range from 6.1 LabMT 
words per tweet for Orlando, Florida up to 7.8 words in 
Seattle, Washington. 
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FIG. A2: Average message length for US cities with more 
than 50000 LabMT words collected during 2011. 

Figure A3 shows choropleths for the number of geo- 
tagged tweets collected (left) and number of geotagged 
tweets normalized by state population (right) for the 2011 
data set. In both plots the color scale is logarithmic. In 
table Al we show the complete list of happiness scores 
for all US states. Word shift plots for each state are 
presented in Appendix B (online). 

In tables A2 and A3 we show lists of the top 25 LabMT 
words with highest positive and negative correlation to 
obesity, respectively. In table A4 we show the words 
with lowest correlation to obesity, that is, the words with 
p- values greater than 0.9. Complete lists for for word 
correlations with all demographic attributes can be found 
in Appendix D (online) [1]. 

B,C,D,E,F Online appendices 

The remaining appendices are located online, 
at http: //www.uvm. edu/storylab/share/papers/ 
mitche 11201 3a/. Appendix B contains word shift 
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FIG. A3: Choropleths showing (the base- 10 logarithm of) raw count (left) and number of geotagged tweets collected per capita 
(right) in each US state during the calendar year 2011. 



plots for all states, Appendix C contains a comparison 
between happiness and the Gallup-Healthways well- 
being measure as well as tweet maps and word shifts for 
all cities, and Appendix D contains complete tables of 
correlations between demographic attributes and both 
happiness and word usage. Appendix E contains the 
complete list of LabMT words ordered by correlation 
with happiness, and Appendix F is a daily- updating 
happiness map of the United States. 
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Rank 


State 


havg 


1 


Hawaii 


6.17 


2 


Maine 


6.14 


3 


Nevada 


6.12 


4 


Utah 


6.11 


5 


Vermont 


6.11 


6 


Colorado 


6.10 


7 


Idaho 


6.10 


8 


New Hampshire 


6.09 


9 


Washington 


6.08 


10 


Wyoming 


6.08 


11 


Minnesota 


6.07 


12 


Arizona 


6.07 


13 


Cahfornia 


6.07 


14 


Florida 


6.06 


15 


New York 


6.06 


16 


New Mexico 


6.05 


17 


Iowa 


6.05 


18 


Oregon 


6.05 


19 


North Dakota 


6.04 


20 


Nebraska 


6.04 


21 


Wisconsin 


6.03 


22 


Kansas 


6.03 


23 


Alaska 


6.02 


24 


Oklahoma 


6.02 


25 


Massachusetts 


6.02 


26 


Montana 


6.01 


27 


Missouri 


6.01 


28 


Kentucky 


6.00 


29 


New Jersey 


5.99 


30 


West Virginia 


5.99 


31 


Illinois 


5.99 


32 


Rhode Island 


5.99 


33 


Indiana 


5.98 


34 


Texas 


5.98 


35 


South Dakota 


5.98 


36 


Virginia 


5.97 


37 


Tennessee 


5.97 


38 


Connecticut 


5.97 


39 


Pennsylvania 


5.97 


40 


South Carolina 


5.96 


41 


North Carolina 


5.96 


42 


Ohio 


5.96 


43 


Arkansas 


5.95 


44 


District of Columbia 


5.94 


45 


Michigan 


5.94 


46 


Alabama 


5.94 


47 


Georgia 


5.94 


48 


Delaware 


5.92 


49 


Maryland 


5.90 


50 


Mississippi 


5.89 


51 


Louisiana 


5.88 



Word 


P 


p- value 






don't 


0.461 


2.28 X 10" 


11 


3 


70 


give 


0.443 


1.57 X 10" 


10 


6 


54 


lie 


0.442 


1.68 X 10" 


10 


2 


60 


hell 


0.438 


2.56 X 10" 


10 


2 


22 


my 


0.438 


2.74 X 10" 


10 


6 


16 


she 


0.433 


4.36 X 10" 


10 


6 


18 


okay 


0.423 


1.18 X 10" 


-9 


6 


56 


like 


0.419 


1.72 X 10" 


-9 


7 


22 


girl 


0.419 


1.76 X 10" 


-9 


7 


00 


know 


0.415 


2.54 X 10" 


-9 


6 


10 


act 


0.412 


3.48 X 10" 


-9 


6 


00 


bitch 


0.411 


4.01 X 10" 


-9 


3 


14 


me 


0.403 


8.5 X 10" 


9 


6 


58 


all 


0.400 


1.08 X 10" 


-8 


6 


22 


nothin 


0.399 


1.14 X 10" 


-8 


3 


64 


better 


0.398 


1.34 X 10" 


-8 


7 


00 


bored 


0.396 


1.5 X 10" 


8 


3 


04 


bed 


0.395 


1.72 X 10" 


-8 


7 


18 


sleep 


0.395 


1.78 X 10" 


-8 


7 


16 


wish 


0.388 


3.25 X 10" 


-8 


6 


92 


never 


0.387 


3.43 X 10" 


-8 


3 


34 


money 


0.380 


6.41 X 10" 


-8 


7 


30 


hate 


0.378 


7.57 X 10" 


-8 


2 


34 


make 


0.376 


9.32 X 10" 


-8 


6 


00 


cant 


0.376 


9.33 X 10" 


-8 


3 


48 



TABLE A2: Top 25 words with strongest positive Spearman 
correlation p to obesity in 2011. Stop words with 4 < /lavg < 6 
have been removed from the list. 



TABLE Al: Happiness scores /lavg for each US 
from highest to lowest. 



state, in order 
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Word 


P 


p- value 






Word 


P 


p- value 




cafe 


-0.509 


6.07 X 10-^^ 


6.78 




olive 


-0.001 


9.94 X 10"^ 


6.00 


photo 


-0.493 


4.87 X 10"^^ 


6.88 




refrigerator 


0.001 


9.9 X 10"^ 


N/A 


thai 


-0.476 


3.69 X 10-^^ 


6.22 




hashbrowns 


0.002 


9.83 X 10-^ 


N/A 


fitness 


-0.472 


5.92 X 10"^^ 


6.92 




eatting 


-0.002 


9.76 X 10-^ 


N/A 


park 


-0.468 


9.59 X 10"^^ 


7.08 




sauteed 


0.003 


9.72 X 10"^ 


N/A 


yoga 


-0.448 


8.82 X 10"^^ 


7.04 




fritos 


-0.003 


9.69 X 10"^ 


N/A 


restaurant 


-0.448 


8.93 X 10-^^ 


7.06 




munch 


0.003 


9.64 X 10"^ 


N/A 


banana 


-0.434 


3.77 X 10-^° 


6.86 




doughnuts 


-0.003 


9.62 X 10-^ 


N/A 


event 


-0.433 


4.54 X 10"^° 


6.12 




cola 


-0.004 


9.62 X 10"^ 


N/A 


hotel 


-0.429 


6.41 X 10"^° 


6.16 




okra 


-0.004 


9.59 X 10-^ 


N/A 


spa 


-0.420 


1.54 X 10"^ 


6.92 




grapes 


0.004 


9.51 X 10"^ 


N/A 


interesting 


-0.420 


1.62 X 10"^ 


7.52 




noodles 


-0.004 


9.51 X 10"^ 


N/A 


design 


-0.409 


4.76 X 10"^ 


6.32 




quiznos 


0.005 


9.49 X 10"^ 


N/A 


apple 


-0.408 


5.22 X 10"^ 


7.44 




cucumbers 


0.005 


9.46 X 10"^ 


N/A 


feliz 


-0.406 


6.47 X 10"^ 


6.04 




chow 


0.006 


9.3 X 10"^ 


N/A 


photos 


-0.404 


7.8 X 10"^ 


6.94 




walnut 


0.007 


9.28 X 10"^ 


N/A 


wine 


-0.400 


1.08 X 10"^ 


6.42 




mulberry 


0.007 


9.19 X 10"^ 


N/A 


bike 


-0.399 


1.22 X 10-^ 


6.72 




muesli 


0.008 


9.17 X 10-^ 


N/A 


sun 


-0.398 


1.31 X 10"^ 


7.80 




hershey's 


0.008 


9.17 X 10"^ 


N/A 


delicious 


-0.392 


2.17 X 10"^ 


7.92 




snickers 


0.008 


9.16 X 10"^ 


N/A 


flight 


-0.391 


2.34 X 10"^ 


6.06 




krispy 


-0.008 


9.15 X 10"^ 


N/A 


sunset 


-0.391 


2.51 X 10"^ 


7.16 




nugget 


-0.008 


9.12 X 10"^ 


N/A 


lounge 


-0.389 


2.93 X 10"^ 


6.50 




smores 


0.008 


9.1 X 10"^ 


N/A 


mortgage 


-0.386 


3.83 X 10-^ 


3.88 




popcorn 


0.009 


9.07 X 10-^ 


6.76 


dinner 


-0.386 


3.85 X 10"^ 


7.40 











TABLE A4: The 24 food-related words which show least 
TABLE A3: Top 25 words with strongest negative Spearman correlation with obesity, and have i^-values greater than 0.9. 
correlation p to obesity in 2011. Stop words with 4 < /lavg < 6 Words are arranged in decreasing order of p-value. 
have been removed from the list. 



